flowchart LR Z[School] --> A Z --> G Z --> L A[Classroom 1] --> B(Student A) A --> C(Student B) A --> E(Student D) G[Classroom 2] --> H(Student H) G --> I(Student I) G --> K(Student K) L[Classroom 3] --> M(Student M)
2025-01-09
Possible fixes (that don’t require throwing out data )
Clustered Standard Errors: using either the bootstrap or a sandwich estimator to adjust the standard errors up
Fixed effects: make a dummy for each group and include it in the model, (potentially) eliminating the autocorrelation while also controlling for unobserved differences across groups
Random effects
(the terminology here is inconsistent and contested, but no one has come up with better words yet)
Fixed Effects Models give every group its own y-intercept. The effects of “Texas” vs. “Tennessee” are assumed to be fixed and unrelated.
Random Effects Models assume that group intercepts are random draws from a common distribution. They can vary just like any other random variable, but they vary in a predictable way
You want to estimate the effect of time spent studying on student achievment
But you also want to account for differences between teachers: you suspect some are slightly better than others.
flowchart LR Z[School] --> A Z --> G Z --> L A[Classroom 1] --> B(Student A) A --> C(Student B) A --> E(Student D) G[Classroom 2] --> H(Student H) G --> I(Student I) G --> K(Student K) L[Classroom 3] --> M(Student M)
You calculate some averages and your results look like this:
| Instructor | Number of students | Average SAT score |
| Teacher 1 | 50 | 1029 |
| Teacher 2 | 25 | 1020 |
| Teacher 3 | 5 | 1120 |
Under a fixed effects assumption, each teacher’s effect is a fixed parameter, and knowing the mean score for teacher 1 and 2 doesn’t tell you anything about what the mean score should be for teacher 3.
Under a random effects assumption, the teacher effect is drawn from a larger distribution that has a central tendency and a variance. So you might look at the value for teacher 3 and conclude: “this seems implausible, in reality, this effect should probably be closer to the effect for the other two”
So we could assume something like this:
\[ SAT = \text{teacher effect} + b\_\text{studying effect} \times \text{study time} + \epsilon \] \[ \epsilon \sim \mathcal{N}(0, \sigma) \]
But we could assume a model with some kind of a “teacher effect” that also had a normal distribution like the error term:
\[ SAT = (\text{teacher effect}) + b\_\text{studying effect} \times \text{study time} + \epsilon\]
\[ \text{teacher effect} \sim \mathcal{N}(0, \tau) \]
In the second case, we estimate a model that slightly biases teacher effects towards the grand mean
Fixed Effects model are best when you
Want to control for un-modeled differences across groups
You think groups are fundamentally “different”
You don’t want to control for any group-level characteristics
Random Effects models are preferable when you
There are lots of small groups with very few members
You’re less worried about unmeasured heterogeneity between clusters
You want to include characteristics across different “levels” or even allow the slopes to vary or certain observations
OLS assumes that data are independent, but your height today is not independent of your height yesterday.
autocorrelation: a value correlates with itself.
Current values often depend on prior values. The democracy score of Paraguay in 2000 is a good predictor of current democracy scores in Paraguay.
Lack of independence poses similar problems to what we observed with clustered samples: we will generally overstate our level of confidence in estimates. We actually have very few truly independent observations here.
Time trends can also be a source of spuriousness. Lots of things increase over time, so it will all look significant in a naive model.
Failure to account for cycles can also just lead to poor prediction in the face of complex trends. The overall trend in total airline flights looks like this:
flights <- readRDS("C:/Users/neilb/Documents/GVPT728_Winter24/flights.RData")|>
mutate(date =as.Date(date),
year = year(date),
quarter = quarter(date)
)
ggplot(flights, aes(x=date, y=total_flights)) +
geom_line() +
theme_minimal() +
geom_smooth(method ='lm', se=FALSE) +
ylab("total flights") +
xlab('date') … but we do a poor job of capturing that if we’re only looking at a few years of data:
We could treat time as a factor and include it as a random effect (but we can’t do time AND case fixed effects because then nothing varies!)
Would could treat time as a random effect term
We could control for time linearly or with a polynomial or spline function
If data depends on prior values, taking the first difference should address it
difference = current_value - prior_value
vdem|>
filter(country_name=='Paraguay')|>
mutate(polyarchy = v2x_polyarchy - dplyr::lag(v2x_polyarchy),
group='differenced'
)|>
bind_rows(vdem|>filter(country_name=="Paraguay")|>
mutate(group='original',
polyarchy = v2x_polyarchy)
)|>
ggplot(aes(x=year, y=polyarchy, color=group)) +
geom_line() +
facet_wrap(~group) +
theme_minimal() +
ylab('Paraguay democracy score')Taking the difference of both polyarchy and GDP in this model should eliminate the time trend effect. But it also alters what we’re actually regressing.
differenced_model<-vdem|>
filter(country_name=='Paraguay')|>
mutate(diff_polyarchy = v2x_polyarchy - dplyr::lag(v2x_polyarchy),
diff_gdp = log(e_gdppc) - log(dplyr::lag(e_gdppc))
)|>
filter(year>=1995)|>
lm(diff_polyarchy ~ diff_gdp, data=_)
huxreg("original_model"= model, "differenced_model" =differenced_model)| original_model | differenced_model | |
|---|---|---|
| (Intercept) | 0.388 *** | 0.002 |
| (0.057) | (0.003) | |
| log(e_gdppc) | 0.100 ** | |
| (0.030) | ||
| diff_gdp | 0.021 | |
| (0.104) | ||
| N | 25 | 25 |
| R2 | 0.328 | 0.002 |
| logLik | 53.157 | 74.457 |
| AIC | -100.315 | -142.914 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | ||
A change in the rate of change in the GDP with lead to a change in the rate of change of the democracy score.
Pro: can account for the primary time series problem without adding any new parameters
Con: is actually a different model! Estimates changes, not levels!
Con: transforms the data in a way that makes it difficult to get predictions
Another option is to include a lagged variable itself as a covariate. This is typically referred to as an autoregressive model.
One advantage of this approach is that you can simultaneously test multiple lags, which helps if you think there’s more than one source of trend.
ar_model2<-vdem|>
filter(country_name=='Paraguay')|>
mutate(lag_polyarchy = dplyr::lag(v2x_polyarchy),
lag_gdp = dplyr::lag(e_gdppc),
lag_gdp2 = dplyr::lag(e_gdppc, 2),
lag_polyarchy2= dplyr::lag(v2x_polyarchy, 2),
)|>
filter(year>=1995)|>
lm(v2x_polyarchy ~ lag_polyarchy + lag_polyarchy2+
log(e_gdppc) +
log(lag_gdp) +
log(lag_gdp2), data=_)
huxreg(ar_model2)| (1) | |
|---|---|
| (Intercept) | 0.114 * |
| (0.044) | |
| lag_polyarchy | 1.172 *** |
| (0.210) | |
| lag_polyarchy2 | -0.362 |
| (0.222) | |
| log(e_gdppc) | 0.143 |
| (0.126) | |
| log(lag_gdp) | -0.010 |
| (0.209) | |
| log(lag_gdp2) | -0.138 |
| (0.128) | |
| N | 25 |
| R2 | 0.921 |
| logLik | 79.965 |
| AIC | -145.930 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |
Pro: allows a much more complex model specification and also tests for autocorrelation
Con: interpretation is weird, and it adds a lot of parameters and inevitable multicollinearity
Including a fixed effect for time can work if you have multiple observations for different periods or if you only suspect a cyclical trend
More complicated autocorrelation structures can be modeled with a regression, and then OLS can be used on the residuals from that model
ACF plots can be used to visualize autocorrelations. If the lagged correlation is above the blue line, then there’s statistically meaningful autocorrelation.
This can make it much easier to spot seasonal trends as well: